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Abstract 

This  note  reviews  basic  techniques  of  linear  path  analysis  and  demonstrates, 
using  simple  examples,  how  causal  phenomena  of  non-trivial  character  can  be 
understood,  exemplified  and  analyzed  using  diagrams  and  a  few  algebraic  steps. 
The  techniques  allow  for  swift  assessment  of  how  various  features  of  the  model 
impact  the  phenomenon  under  investigation.  This  includes:  Simpson’s  paradox, 
case-control  bias,  selection  bias,  collider  bias,  reverse  regression,  bias  amplifi¬ 
cation,  near  instruments,  and  measurement  errors. 


1  Introduction 

Many  concepts  and  phenomena  in  causal  analysis  were  first  detected,  quantified  and 
exemplified  in  linear  structural  equation  models  (SEM)  before  they  were  understood 
in  full  generality  and  applied  to  nonparametric  problems.  Linear  SEM’s  can  serve  as  a 
“microscope”  for  causal  analysis;  they  provide  simple  and  visual  representation  of  the 
causal  assumptions  in  the  model  and  often  enable  us  to  derive  close-form  expressions 
for  quantities  of  interest  which,  in  turns,  can  be  used  to  assess  how  various  aspects  of 
the  model  affect  the  phenomenon  under  investigation.  Likewise,  linear  models  can  be 
used  to  test  general  hypotheses  and  to  generate  counter-examples  to  over-ambitious 
conjectures. 

Despite  their  ubiquity,  however,  techniques  for  using  linear  models  in  that  capacity 
have  all  but  disappeared  from  the  main  SEM  literature,  where  they  have  been  replaced 
by  matrix  algebra  on  the  one  hand  and  software  packages  on  the  other.  Very  few 
analysts  today  are  familiar  with  traditional  methods  of  path  tracing  (Wright,  1921; 
Duncan,  1975;  Kenny,  1979;  Heise,  1975)  which,  for  small  problems,  can  provide  both 
intuitive  insight  and  easy  derivations  using  elementary  algebra. 
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This  note  attempts  to  fill  this  void  by  introducing  the  basic  techniques  of  path 
analysis  to  modern  researchers,  and  demonstrating,  using  simple  examples,  how  con¬ 
cepts  and  issues  in  modern  causal  analysis  can  be  understood,  and  analyzed  in  SEM. 
These  will  include:  Simpson’s  paradox,  case-control  bias,  selection  bias,  collider  bias, 
reverse  regression,  bias  amplification,  near  instruments,  measurement  errors,  and 
more. 


2  Preliminaries 


2.1  Covariance,  regression,  and  correlation 

We  start  with  the  standard  definition  of  variance  and  covariance  on  a  pair  of  variables 
X  and  Y.  The  variance  of  X  is  defined  as 

°l  =  E[X  -  E(x)f 

and  measures  the  degree  to  which  X  deviates  from  its  mean  E(X). 

The  covariance  of  X  and  Y  is  defined  as 


<jxy  =  E[X  -  E(x)][Y  -  E(Y)\ 


and  measures  the  degree  to  which  X  and  Y  covary. 

Associated  with  the  covariance,  we  define  two  other  measures  of  association:  (1) 
the  regression  coefficient  Pyx  and  (2)  the  correlation  coefficient  pyx.  The  relationships 
between  the  three  is  given  by  the  following  equations: 


Pxy 

Pyx 


a, 


xy 


a. 


a. 


xy_ 

2 


a. 


"  Pxy 


(1) 

(2) 


We  note  that  pxy  =  pyx  is  dimensionless  and  confined  to  the  unit  interval;  0  < 
Pxy  <  1-  The  regression  coefficient,  Pyx,  represents  the  slope  of  the  least  square  error 
line  in  the  prediction  of  Y  given  X 


P 


yx  ~ 


d_ 

dx 


E(Y\X  =  x ) 


2.2  Partial  correlations  and  regressions 

Many  questions  in  causal  analysis  concern  the  change  in  a  relationship  between  X 
and  Y  conditioned  on  a  given  set  Z  of  variables.  The  easiest  way  to  define  this  change 
is  through  the  partial  regression  coefficient  pyx.z  which  is  given  by 

Pyx.z  =  -^E(Y\X  =  x,Z  =  z) 

In  words,  Pyx.z  is  the  slope  of  the  regression  line  of  Y  on  X  when  we  consider  only 
cases  for  which  Z  =  z. 
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The  partial  correlation  coefficient  pxy.z  can  be  defined  by  normalizing  (3yx.z: 


Pxy-z  fiyx.z&x-z/ '  @y-z- 


A  well  known  result  in  regression  analysis  (Cramer,  1946)  permits  us  to  express 
pxy.z  recursively  in  terms  of  pair-wise  regression  coefficients.  When  Z  is  singleton, 
this  reduction  reads: 

_  Pyx  PyzPxz  /Q\ 

PyX'Z  ~  ~U\  ETA  TWT 
[(!  -PxzA1  -Pxz)\2 

Accordingly,  we  can  also  express  /3yx.z  and  ayx.z  in  terms  of  pair-wise  relationships, 
which  gives: 

ayx-z  —  \J &xx  —  ^Iz/ a2z\j  ayy  —  ^yz/^l  Pyx-z  (4) 


@yx-z  Pyx  ftyzPzx 


yx 


Oyz&zx 


(Jt 


(5) 


Pyx-Z 


f^yx  PyzPzx 


Oyz&zx 


&y  Pyx  Pyz  *  Pzx 


j2 

yxz 


(6) 


1~PL  allal  !-p: 

Note  that  none  of  these  conditional  associations  depends  on  the  level  z  at  which 
we  condition  variable  Z\  this  is  one  of  the  features  that  makes  linear  analysis  easy  to 
manage  and,  at  the  same  time,  limited  in  the  spectrum  of  relationships  it  can  capture. 


2.3  Path  diagrams  and  structural  equation  models 

A  linear  structural  equation  model  (SEM)  is  a  system  of  linear  equations  among 
a  set  V  of  variables,  such  that  each  variable  appears  on  the  left  hand  side  of  at 
most  one  equation.  For  each  equation,  the  variable  on  its  left  hand  side  is  called 
the  dependent  variable,  and  those  on  the  right  hand  side  are  called  independent  or 
explanatory  variables.  For  example,  the  equation  below 

Y  =  aX  +  /3Z  +  UY  (7) 

declares  Y  as  the  dependent  variable,  X  and  Z  as  explanatory  variables,  and  Uy  as  an 
“error”  or  “disturbance”  term,  representing  all  factors  omitted  from  V  that,  together 
with  X  and  Z  determine  the  value  of  Y .  A  structural  equation  should  be  interpreted 
as  an  assignment  process,  i.e.,  to  determine  the  value  of  Y,  nature  consults  the  value 
of  variables  X,  Z  and  Uy  and,  based  on  their  linear  combination  in  (7),  assigns  a 
value  to  Y . 

This  interpretation  renders  the  equality  sign  in  Eq.  (7)  non-symmetrical,  since  the 
values  of  X  and  Z  are  not  determined  by  inverting  (7)  but  by  other  equations,  for 
example, 


X  =  7Z  +  Ux  (8) 

Z  =  Uz  (9) 
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Figure  1: 

The  directionality  of  this  assignment  process  is  captures  by  a  path- diagram,  in 
which  the  nodes  represent  variables,  and  the  arrows  represent  the  non-zero  coefficients 
in  the  equations.  The  diagram  in  Fig.  1(a)  represents  the  SEM  equations  of  (7)-(9) 
and  the  assumption  of  zero  correlations  between  the  U  variables, 

CUxVy  —  auxVz  —  auz,uY  =  0 

The  diagram  in  Fig.  1(b)  on  the  other  hand  represents  equations  (7)-(9)  together  with 
the  assumption 

<7Ux,Uz  =  &Uz,UY  —  0 

while  crux,uY  =  Cxy  remains  undetermined. 

The  coefficients  a,  (3,  and  7  are  called  path  coefficients,  or  structural  parameters 
and  they  carry  causal  information.  For  example,  a  stands  for  the  change  in  Y  induced 
by  raising  X  one  unit,  while  keeping  all  other  variables  constant.1 

The  assumption  of  linearity  makes  this  change  invariant  to  the  levels  at  which  we 
keep  those  other  variables  constant,  including  the  error  variables;  a  property  called 
“effect  homogeneity.”  Since  errors  (e.g.,  Ux,Uy,Yz )  capture  variations  among  in¬ 
dividual  units  (i.e.,  subjects,  samples,  or  situations),  effect  homogeneity  amounts  to 
claiming  that  all  units  react  equally  to  any  treatment,  which  may  exclude  applications 
with  profoundly  heterogeneous  subpopulations. 

2.4  Wright’s  path-tracing  rules 

In  1921,  the  geneticist  Sewall  Wright  developed  an  ingenious  method  by  which  the 
covariance  axy  of  any  two  variables  can  be  determined  swiftly,  by  mere  inspection  of 
the  diagram  (Wright,  1921).  Wright’s  method  consists  of  equating  the  (standardized2) 

headers  familiar  with  do-calculus  (Pearl,  1995)  can  interpret  a  as  the  experimental  slope 

a  =  J~^KE|do(a;),do(2))] 

while  those  familiar  with  counterfactual  logic  can  write  a  =  Jf  Yxz(u).  The  latter  implies  the  former, 
and  the  two  coincide  in  linear  models,  where  causal  effects  are  homogeneous  (i.e.,  unit-independent.) 

2  Standardized  parameters  refer  to  systems  in  which  (without  loss  of  generality)  all  variables  are 
normalized  to  have  zero  mean  and  unit  variance,  which  significantly  simplifies  the  algebra. 
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covariance  axy  =  pxy  between  any  pair  of  variables  to  the  sum  of  products  of  path 
coefficients  and  error  covariances  along  all  d-connected  paths  between  X  and  Y.  A 
path  is  (/-connected  if  it  does  not  traverse  any  collider  (i.e.,  head-to-head  arrows,  as 
in  X  ->■  Y  <-  Z). 

For  example,  in  Fig.  1(a),  the  standardized  covariance  <rxy  is  obtained  by  summing 
a  with  the  product  /hy,  thus  yielding  axy  =  a  +  /hy,  while  in  Fig.  1(b)  we  get: 
axy  =  a  +  /?7  +  Cxy-  Note  that  for  the  pair  (X,  Z),  we  get  crxz  =  (3  since  the  path 
X  — y  Y  •(—  Z  is  not  (/-connected. 

The  method  above  is  valid  for  standardized  variables,  namely,  variables  normalized 
to  have  zero  mean  and  unit  variance.  For  non-standardized  variables  the  method 
need  to  be  modified  slightly,  multiplying  the  product  associated  with  a  path  p  by  the 
variance  of  the  variable  that  acts  as  the  “root”  for  path  p.  For  example,  for  Fig.  1(a) 
we  have  axy  =  a 2a  +  cr2j3'y,  since  X  serves  as  the  root  for  path  X  — »  Y  and  Z  serves 
as  the  root  for  X  4—  Z  — *  Y.  In  Fig.  1(b),  however,  we  get  axy  =  a2  a  +  +  Cxy 

where  the  double  arrow  Ux  ^  Uy  serves  as  its  own  root. 


2.5  Reading  partial  correlations  from  path  diagrams 

The  reduction  from  partial  to  pair-wise  correlations  summarized  in  equations  (4)- 
(6),  when  combined  with  Wright’s  path-tracing  rules  permits  us  to  extend  the  latter 
so  as  to  read  partial  correlations  directly  from  the  diagram.  For  example,  to  read 
the  partial  regression  coefficient  f3xlJ.z,  we  start  with  a  standardized  model  where  all 
variances  are  unity  (hence  axy  =  pxy  =  j3xy),  and  apply  Eq.  (6)  with  ax  =  crz  =  1  to 
get: 

a  _  (ayx  ~  Gyz&. 

PvX'Z  (1  -  alz) 

At  this  point,  each  pair-wise  covariance  can  be  computed  from  the  diagram  through 
path-tracing  and,  substituted  in  (10),  yields  an  expression  for  the  partial  regression 
coefficient  /3yx.z. 

To  witness,  the  pair-wise  covariances  for  Fig.  1(a)  are: 


cryx  =  a  +  /3y 

(ii) 

&xz  T 

(12) 

CTyz  =  P  + 

(13) 

(10),  we  get 

Pyx-z  =  [(a  +  Pi)  ~(P  +  la)l\/ (1  -  72) 

=  a(l  —  72)/ (1  —  72) 

=  a 

(14) 

Indeed,  we  know  that,  for  a  confounding- free  model  like  Fig.  1(a)  the  direct  effect 
a  is  identifiable  and  given  by  the  partial  regression  coefficient  /3xy.z.  Repeating  the 
same  calculation  on  the  model  of  Fig.  1(b)  yields: 


Pyx-z  —  a  +  Cxy 
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leaving  a  non-identifiablc. 

Armed  with  the  ability  to  read  partial  regressions,  we  are  now  prepared  to  demon¬ 
strate  some  peculiarities  of  causal  analysis. 


3  The  Microscope  at  Work:  Examples  and  their 
Implications 

3.1  Simpson’s  paradox 

Simpson’s  paradox  describes  a  phenomenon  whereby  an  association  between  two  vari¬ 
ables  reverses  sign  upon  conditioning  on  a  third  variable,  regardless  of  the  value  taken 
by  the  latter.  The  history  of  this  paradox  and  the  reasons  it  evokes  surprise  and  dis¬ 
belief  are  described  in  Chapter  6  of  (Pearl,  2009a). 

The  conditions  under  which  association  reversal  appears  in  linear  models  can  be 
seen  directly  in  Fig.  1(a).  Comparing  (12)  and  (14)  we  obtain 

Pyx  Ot  -\-  /3y  Pyx-Z  a 

Thus,  if  a  has  a  different  sign  from  /3q,  it  is  quite  possible  to  have  the  regression  of 

Y  on  A",  Pyx,  change  sign  upon  conditioning  on  Z  =  z,  for  every  z.  The  magnitude 
of  the  change  depends  on  the  product  /3q  which  measures  the  extent  to  which  X  and 

Y  are  confounded  in  the  model. 


3.2  Conditioning  on  intermediaries  and  their  proxies 

Conventional  wisdom  informs  us  that,  in  estimating  the  effect  of  one  variable  on 
another,  one  should  not  adjust  for  any  covariate  that  lies  on  the  pathway  between 
the  two  (Cox,  1958).  ft  took  decades  for  epidemiologists  to  discover  that  similar 
prohibition  applies  to  proxies  of  intermediaries  (Weinberg,  1993).  The  amount  of 
bias  introduced  by  such  adjustment  can  be  assessed  from  Fig.  2. 

a  »  P  ( 

x'  ^Tz  **  Y 

w 

Figure  2: 

Here,  the  effect  of  X  on  Y  is  simply  a/3  as  is  reflected  by  the  regression  slope 
Pyx  —  otP-  If  we  condition  on  the  intermediary  Z,  the  regression  slope  vanishes,  since 
the  equality  ayx  —  ap  —  cryzazx  renders  Pxy.z  zero  in  Eq.  (10).  If  we  condition  on  a 
proxy  W  of  Z,  Eq.  (10)  yields 

Q  _  Pyx  -  PywPwx  _  ap  -  P^a  _  ap(l  -  ”(2) 

Pyx-w  —  1/09  —  1  9  2  1  22  V 

1  -  Pix  1  -  a  V  1  -  a  7 
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which  unveils  a  bias  of  size 


Pyx-*  ~ot/3  =  a/?72(l  -  a2)/(l  -  a V) 

As  expected,  the  bias  disappears  for  7  =  0  and  intensifies  for  7  =  1,  where  condition¬ 
ing  on  W  amounts  to  suppressing  all  variations  in  Z. 

Speaking  of  suppressing  variations,  the  model  in  Fig.  3  may  carry  some  surprise. 


W 


X  Z  Y 


Figure  3: 

Conditioning  on  W  in  this  model  also  suppresses  variations  in  Z,  especially  for  high 
7  and,  yet,  it  introduces  no  bias  whatsoever;  the  partial  regression  slope  is  (Eq.  10): 

P,„  =  =  =  ap  (16) 

which  is  precisely  the  causal  effect  of  X  on  Y.  It  seems  as  though  no  matter  how 
tightly  we  “clamp”  Z  by  controlling  W,  the  causal  effect  of  X  on  Y  remains  unaltered. 
Appendix  I  explains  this  counter-intuitive  result. 

3.3  Case-control  bias 

In  the  last  section,  we  explained  the  bias  introduced  by  conditioning  on  an  interme¬ 
diate  variable  (or  its  proxy)  as  a  restriction  on  the  flow  of  information  between  X 
and  Y.  This  explanation  is  not  entirely  satisfactory,  as  can  be  seen  from  the  model  of 
Fig.  4.  Here,  Z  is  not  on  the  pathway  between  X  and  Y,  and  one  might  surmise  that 

a  5  r 

X*  *'*  Z 

Figure  4: 

no  bias  would  be  introduced  by  conditioning  on  Z,  but  analysis  dictates  otherwise. 
Path  tracing  combined  with  Eq.  (10)  gives: 

Pyx-z  —  (&yx  —  Vyz&zx) / 0-  ~  G Xz )  (1^) 

=  (a  —  62a)/{  1  —  a2S2) 

=  a(l-62)/(l  -a2d2) 

and  yields  the  bias 

Pyx-z  -  OL  =  aS2(a2  -  1)/(1  -  a2S2)  (18) 
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This  bias  reflects  what  epidemiologist  call  “case-control  bias”  (Robins,  2001)  which 
occurs  when  only  patients  for  whom  the  outcome  Y  is  evidenced  (e.g.,  a  complication 
of  a  disease)  are  counted  in  the  database.  An  intuitive  explanation  of  this  bias 
(invoking  virtual  colliders)  is  given  in  (Pearl,  2009a,  p.  339).  In  contrast,  conditioning 
on  a  proxy  of  the  explanatory  variable  X,  as  in  Fig.  5,  introduces  no  bias,  since 


X 


a 


Y 


b 


! 

z 


Figure  5: 


P 


yx-z 


a  —  ( ab)b 
1  -  b 2 


=  a 


(19) 


This  can  also  be  deduced  from  the  conditional  independence  ZALY\X  which  is 
implied  by  the  diagram  in  Fig.  5,  but  not  in  Fig.  4.  However,  to  assess  the  size  of  the 
induced  bias,  as  we  did  in  Eq.  (18),  requires  an  algebraic  analysis  of  path  tracing. 


3.4  Sample  Selection  Bias 

The  two  examples  above  are  special  cases  of  a  more  general  phenomenon  called  “se¬ 
lection  bias”  which  occurs  when  samples  are  preferentially  selected  to  the  data  set, 
depending  on  the  values  of  some  variables  in  the  model  (Bareinboim  and  Pearl,  2012; 
Daniel  et  al.,  2011;  Geneletti  et  ah,  2009;  Pearl,  2012).  In  Fig.  6,  for  example,  if 
Z  =  1  represents  inclusion  in  the  data  set,  and  Z  =  0  exclusion,  the  selection  decision 

Y 


Figure  6: 

is  shown  to  be  a  function  of  both  X  and  Y.  Since  inclusion  (Z  =  1)  amounts  to 
conditioning  on  Z,  we  may  ask  what  the  regression  of  Y  on  X  is  in  the  observed 
data,  f3yx.z,  compared  with  the  regression  in  the  entire  population,  j3yx  =  a  +  7. 
Applying  our  path-tracing  analysis  in  (10)  we  get: 

n  _  °yx  -  VyzCTzx  _  (a  +  7)  ~  [(a  +  7)6  +  c\  [b  +  (a  +  7)0]  _  (a  +  7)  [1  -  b2  -  c2]  -  be 

yx'z  1  —  <J2ZX  1  -  [b  +  (a  +  q)c]2  1  -  [b  +  (a  +  y)c]2 

(20) 
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We  see  that  a  substantial  bias  may  result  from  conditioning  on  Z,  persisting  even 
when  X  and  Y  are  not  correlated,  namely,  when  axy  =  a  +  7  =  0.  Note  also  that 
the  bias  disappears  for  c  =  0,  as  in  Fig.  5,  but  not  for  6  =  0,  which  returns  us  to  the 
case-controlled  model  of  Fig.  4. 

Selection  bias  is  symptomatic  of  a  general  phenomenon  associated  with  condi¬ 
tioning  on  collider  nodes  (Z  in  our  example).  The  phenomenon  involves  spurious 
associations  induced  between  two  causes  upon  observing  their  common  effect,  since 
any  information  refuting  one  cause  should  make  the  other  more  probable.  It  has 
been  known  as  Berkson  Paradox  (Berkson,  1946),  “explaining  away”  (Kim  and  Pearl, 
1983)  or  simply  “collider  bias.”3 

3.5  The  M- bias 

The  M- bias  is  another  instant  of  Berkson’s  paradox  where  the  conditioning  variable, 
Z,  is  a  pre-treatment  covariate,  as  depicted  in  Fig.  7.  The  parameters  71  and  72 


Figure  7: 


represent  error  covariances  Cxz  and  Czy ,  respectively,  which  can  be  generated,  for 
example,  by  latent  variables  effecting  each  of  these  pairs. 

To  analyze  the  size  of  this  bias,  we  apply  Eq.  (10)  and  get: 

a  _  a  -  (72  +  a7i)7i  _  7i72  /oiN 

pyx.z  -  \-ryl  ~a  1  -  7l2  1  J 

Thus,  the  bias  induced  increases  substantially  when  71  approaches  one,  that  is,  when 
Z  becomes  a  good  predictor  of  X.  Ironically,  this  is  precisely  when  investigators  have 
all  the  textbook  reasons  to  adjust  for  Z.  Being  pre-treatment,  the  collider  Z  cannot 
be  distinguished  from  a  confounder  (as  in  Fig.  1(a))  by  any  statistical  means,  and  has 
alluded  some  statisticians  to  conclude  that  “there  is  no  reason  to  avoid  adjustment 
for  a  variable  describing  subjects  before  treatment”  (Rosenbaum,  2002,  p.  76). 

3.6  Reverse  Regression 

Is  it  possible  that  men  would  earn  a  higher  salary  than  equally  qualified  women,  and 
simultaneously,  men  are  more  qualified  than  women  doing  equally  paying  job?  This 
counter-intuitive  condition  can  indeed  exist,  and  has  given  rise  to  a  controversy  called 

3It  has  come  to  my  attention  recently,  and  I  feel  responsibility  to  make  it  public,  that  seasoned 
reviewers  for  highly  reputable  journals  reject  papers  because  they  are  not  convinced  that  such  bias 
can  be  created;  it  defies,  so  they  claim,  everything  they  have  learned  from  statistics  and  economics. 
A  typical  resistance  to  accepting  Berkson’s  Paradox  is  articulated  in  (Rubin,  2009)  (Pearl,  2009b). 
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“Reverse  Regression;”  some  sociologists  argued  that,  in  salary  discrimination  cases, 
we  should  not  compare  salaries  of  equally  qualified  men  and  women,  but,  rather, 
compare  qualifications  of  equally  paid  men  and  women  (Goldberger,  1984). 

The  phenomenon  can  be  demonstrated  in  Fig.  8.  Let  X  stand  for  gender  (or 


Figure  8: 

age,  or  socioeconomic  background),  Y  for  job  earnings  and  Z  for  qualification.  The 
partial  regression  /3yx.z  encodes  the  differential  earning  of  males  (. X  =  1)  over  females 
(A"  =  0)  having  the  same  qualifications  (Z  =  z),  while  /3zx.y  encodes  the  differential 
qualification  of  males  (A"  =  1)  over  females  (X  =  0)  earning  the  same  salary  (y). 

For  the  model  in  Fig.  8,  we  have 

fiyx-z  ® 

Arc-1,  =  {<7zx  ~  °zy°yx)/(  1  ~  <J%)  =  [(b  ~W  +  7 «)(«  +  Ab)]/[1  ~  (A  +  7«)2] 

Surely,  for  any  a  >  0  and  /3  >  0  we  can  choose  7  so  as  to  make  j3zx.y  negative.  For 
example,  the  combination  a  =  (3  =  0.8  and  7  =  0.1  yields 

/3zx.y  =  [(0.1  -  (0.8  +  0.1  x  0.8)(0.8  +  0.8  x  0.1)]/[1  -  (0.8  +  0.1  x  0.9)2]  =  -5.8545 

Thus,  there  is  no  contradiction  in  finding  men  earning  a  higher  salary  than  equally 
qualified  women,  and  simultaneously,  men  being  more  qualified  than  women  doing 
equally  paying  job.  A  negative  /3zx.y  may  be  a  natural  consequence  of  male-favoring 
hiring  policy  (a  >  0),  male-favoring  training  policy  (7  >  0)  and  qualification- 
dependent  earnings  (j3  >  0). 

The  question  of  whether  standard  or  reverse  regression  is  more  appropriate  for 
proving  discrimination  is  also  clear.  The  equality  / 3yx.z  =  a  leaves  no  room  for  hes¬ 
itation,  because  a  coincides  with  the  counterfactual  definition  of  “direct  effect  of 
gender  on  hiring  had  qualification  been  the  same,”  which  is  the  court’s  definition  of 
discrimination. 

The  reason  the  reverse  regression  appeals  to  intuition  is  because  it  reflects  a  model 
in  which  the  employer  decides  on  the  qualification  needed  for  a  job  on  the  basis  of 
both  its  salary  level  and  the  applicant  sex.  If  this  were  a  plausible  model,  it  would 
indeed  be  appropriate  to  persecute  an  employer  who  demands  higher  qualifications 
from  men  as  opposed  to  women.  But  such  a  model  should  place  Z  as  a  post-salary 
variable  e.g.,  X  — >•  Z  Y . 

3.7  Bias  Amplification 

In  the  model  of  Fig.  9,  Z  acts  as  an  instrumental  variable,  since  azu  =  0.  If  U  is 
unobserved,  however,  Z  cannot  be  distinguished  from  a  confounder,  as  in  Fig.  1(a), 
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in  the  sense  that  for  every  set  of  parameters  (a, /3, 7)  in  Fig.  1(a)  one  can  find  a  set 
(1 a,b,c,d )  for  the  model  in  Fig.  9  such  that  the  observed  covariance  matrices  of  the 
two  models  are  the  same.  This  indistinguishability,  together  with  the  fact  that  Z 
may  be  a  strong  predictor  of  X  may  lure  investigators  to  condition  on  Z  to  obtain  an 
unbiased  estimate  of  d  (Hirano  and  Imbens,  2001).  Recent  work  has  shown  however 
that  such  adjustment  would  amplify  the  bias  created  by  U  (Bhattacharya  and  Vogt, 
2007;  Pearl,  2010a;  Wooldridge,  2009). 

The  magnitude  of  this  bias  and  its  relation  to  the  pre-conditioning  bias,  ab,  can 
be  computed  from  the  diagram  of  Fig.  9,  as  follows: 


Z  c 


Y 


Figure  9: 


fiyx-z 


O xy  &vz  * 


yz 


1  —  al 


0 ab  +  7o)  -  c7oc 
1  —  c2 


7o  + 


(22) 


We  see  the  the  bias  created,  ypyyy,  is  proportional  to  the  pre-existing  bias  ab  and 
increases  with  c;  the  better  Z  predicts  X,  the  higher  the  bias.  An  intuitive  explanation 
of  this  phenomenon  is  given  in  Pearl  (2010a) 


3.8  Near  Instruments  -  amplifiers  or  attenuators? 


Figure  10: 

The  model  in  Fig.  10  is  indistinguishable  from  that  of  Fig.  9  when  U  is  unobserved. 
However,  here  Z  acts  both  as  an  instrument  and  as  a  confounder.  Conditioning  on  Z 
is  beneficial  in  blocking  the  confounding  path  X  ■<—  Z  — >  Y  and  harmful  in  amplifying 
the  baseline  bias  cd+ab.  The  trade  off  between  these  two  tendencies  can  be  quantified 
by  computing  /3yx.z,  yielding 
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yx-z 


O xy  ®yz®zx 

!- AL 

7o  +  cd  +  ab  —  (d  +  C70)c 
1  —  c2 
7o(l  -  c2)  +  ab 
1  —  c2 
ab 


(23) 


We  see  that  the  baseline  bias  ab  +  cd  is  first  reduced  to  ab  and  then  magnified 
by  the  factor  (1  —  c2)-1.  For  Z  to  be  a  bias-reducer,  its  effect  on  Y  (i.e.,  d )  must 
exceed  its  effect  on  X  (i.e.,  c)  by  a  factor  ab/(  1  —  c2).  This  trade-off  was  assessed 
by  simulations  in  (Myers  et  ah,  2011)  and  analytically  in  (Pearl,  2011),  including  an 
analysis  of  multi-confounders,  and  nonlinear  models. 


3.9  The  Butterfly 

Another  model  in  which  conditioning  on  Z  may  have  both  harmful  and  beneficial 
effects  is  seen  in  Fig.  11.  ffere,  Z  is  both  a  collider  and  a  confounder.  Conditioning 


Figure  11: 


on  Z  blocks  the  confounding  path  through  ay  and  a2  and,  at  the  same  time  induces 
a  virtual  confounding  path  through  the  latent  variables  that  create  the  covariances 
Cxz  =  7i  and  Czy  =  72- 

This  trade-off  can  be  evaluated  from  our  path-tracing  formula  Eq.  (10)  which 
yields 


A 


yx-z 


Pyx  PyzPzx  _  \P  +  (an  +  7i)«2  +  Q7 72]  -  [og  +  72  +  A (71  +  ai)]  [7i  +  ai] 
1  l-(ai+7i)2 

A~727i~A(7i  +  «i)2 

l-(ai  +  7i)2 


We  first  note  that  the  pre-conditioning  bias 


Pxy  -  A  -  07(07  +  7i)  +  0172 


(25) 
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may  have  positive  or  negative  values  even  when  both  axz  =  0  and  azy  =  0.  This  refutes 
folklore  wisdom,  according  to  which  a  variable  Z  can  be  exonerated  from  confounding 
considerations  if  it  is  uncorrelated  with  both  treatment  (A")  and  outcome  (Y). 

Second,  we  notice  that  conditioning  on  Z  may  either  increase  or  decrease  bias, 
depenending  on  the  structural  parameters.  This  can  be  seen  by  comparing  (25)  with 
the  post-conditioning  bias: 

Pxy-z  ~P  =  — 7i72/[1  -  (<*i  +  7i)2]  (26) 

In  particular,  since  Eq.  (26)  is  independent  on  a2)  if  is  easy  to  choose  values  of 
a2  that  make  (25)  either  higher  of  lower  than  (26). 

3.10  Measurement  error 


(a)  (b) 


Figure  12: 

Assume  the  confounder  U  in  Fig.  12(a)  is  unobserved  but  we  can  measure  a  proxy 
Z  of  U .  Can  we  assess  the  amount  of  bias  introduced  by  adjusting  for  Z  instead  of 
U1  The  answer,  again,  can  be  extracted  from  our  path-tracing  formula,  which  yields 


_  CTyx  -  VyzVzx  _  (a  +  Pi)  -  {18  +  apS)pS 

1-al  -  l-pw 

_a  +  P5-  PS2( 7  +  aP)  _  a(l  -  P282)  +  7/3(1  -  <52) 
“  1  -  P25 2  “  1  -  P252 

,  7/^(1  -(f2) 

"+  1  —  P252 


(27) 


As  expected,  the  bias  vanishes  when  5  approaches  unity,  indicating  a  faithful 
proxy.  Moreover,  if  5  can  be  estimated  from  an  external  pilot  study,  the  causal  effect 
a  can  be  identified.  (See  Pearl,  2010b;  Kuroki  and  Pearl,  2013.)  Remarkably,  identical 
behavior  emerges  in  the  model  of  Fig.  12(b)  in  which  Z  is  a  driver  of  U,  rather  than 
a  proxy. 

The  same  treatment  can  be  applied  to  errors  in  measurements  of  X  or  of  Y  and, 
in  each  case,  the  formula  of  axy.z  reveals  what  model  parameters  are  the  ones  affecting 
the  resulting  bias. 
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4  Conclusions 


We  have  demonstrated  how  path-analytic  techniques  can  illuminate  the  emergence 
of  several  phenomena  in  causal  analysis  and  how  these  phenomena  depend  on  the 
structural  features  of  the  model.  Although  the  techniques  are  limited  to  linear  anal¬ 
ysis,  hence  restricted  to  homogeneous  populations  with  no  interactions,  they  can  be 
superior  to  simulation  studies  whenever  conceptual  understanding  is  of  essence,  and 
problem  size  is  manageable. 
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Appendix  I 

In  linear  systems,  the  explanation  for  the  equality  ayx  =  cryx.w  in  Fig.  3  is  simple. 
Conditioning  on  W  does  not  physically  constrain  Z.  it  merely  limits  the  variance  of 
Z  in  the  subpopulation  satisfying  W  =  w  which  was  chosen  for  observations.  Given 
that  effect- homogeneity  prevails  of  linear  models,  we  know  that  the  effect  of  X  on 
Z  remains  invariant  to  the  level  w  chosen  for  observation  and,  therefore,  this  un¬ 
specific  effect  reflects  the  effect  of  X  on  the  entire  population.  This  dictates  (in  a 
confounding- free  model)  /3xy.w  =  (3xy. 

But  how  can  we  explain  the  persistence  of  this  phenomenon  in  nonparametric 
models,  where  we  know  (e.g.,  using  do-calculus)  that  adjustment  for  W  does  not  have 
any  effect  on  the  resulting  estimand?  In  other  words,  the  equality 

E[Y\X  =  x}  =  EWE[Y\X  =  x,W  —  w] 

will  hold  in  the  model  of  Fig.  3  even  when  the  structural  equations  are  non-linear. 
Indeed,  the  independence  of  W  and  X,  implies 

E[Y\X  =  x]  =  E[Y\X  =  x,  W  =  w]P(W  =  w\X  =  x) 

W 

=  E[Y\X  =  x,W  =  w)P{W  =  w) 

W 

=  EWE[Y\X  =  x,W  =  w\ 

The  answer  is  that  adjustment  for  W  involves  averaging  over  W ;  conditioning  on 
W  does  not.  In  other  words,  whereas  the  effect  of  X  on  Z  may  vary  across  strata  of 
W,  the  average  of  this  effect  is  none  other  but  the  effect  over  the  entire  population, 
i.e.,  E\Y\do(X  =  x)],  which  equals  E\Y\X  =  x]  in  the  non-confounding  case. 
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Symbolically,  we  have 


E[Y\do(X  =  x)]  =  Y^  E[Y\do(X  =  x),W  =  w]P[W  =  w\do(X  =  x)] 

W 

=  E[Y\do{X  =  x),  W  =  w}P(W  =  w) 

W 

=  E[Y\X  =  x,  W  =  w\P(W  =  w) 

W 

=  E(Y\X  =  x ) 

The  first  reduction  is  licensed  by  the  fact  that  X  has  no  effect  on  W  and  the  second 
by  the  back-door  condition. 
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